Predicting Pulsar Stars (group 26 project proposal)¶

Introduction¶

As technology evolves, the use of classification in the field of Data Science has gained popularity in various fields. As we navigate through our Data Science 100 course, our group has decided to refine our understanding of classification set problems. To do this, we have chosen a data set related to pulsar stars as our research dataset.

Fundamentally, a pulsar is a type of neutron star that emits energy detectable by current technology (1). These metrics are collected to help astronomers and astrophysicists test theories and models; for instance, Albert Einstein’s theory of General Relativity. However, with the limitations of our technology, these signals may be misidentified. There are instances when signals collected may not necessarily identify the presence of a pulsar, instead, be human-produced radio signals bounced back from space. In the dataset presented, it contains nine columns with the first eight columns containing the following variables as listed below:

  • Mean of the integrated profile.
  • Standard deviation of the integrated profile.
  • Excess kurtosis of the integrated profile.
  • Skewness of the integrated profile.
  • Mean of the DM-SNR curve.
  • Standard deviation of the DM-SNR curve.
  • Excess kurtosis of the DM-SNR curve.
  • Skewness of the DM-SNR curve.

The legitimacy of a pulsar signal will be represented in the ninth row, which contains a result of either “0” or “1”. “0” is identified as non-pulsar signals while “1” is identified as legitimate pulsar signals (2).

In this report, we will utilize different classification processes to answer our research question: To determine whether a given observation is a pulsar, and to determine the predictors that contribute to such prediction.

Preliminary exploratory data analysis¶

1. Loading necessary libraries and initialization¶

In [1]:
library(tidyverse)
library(tidymodels)
library(reshape2)
install.packages("kknn")

set.seed(2023) # a constant seed to keep the analysis reproducible
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ rsample      1.2.0
✔ dials        1.2.0     ✔ tune         1.1.2
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.1     ✔ yardstick    1.2.0
✔ recipes      1.0.8     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.


Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

2. Reading data from web¶

In [2]:
# read the data and assign column names
pulsar_data <- read_csv("https://raw.githubusercontent.com/sinamhdv/DSCI-100-project-group-26/main/HTRU_2.csv", col_names = FALSE)
colnames(pulsar_data) <- c("profile_mean", "profile_stdev", "profile_skewness", "profile_kurtosis", "dm_mean", "dm_stdev", "dm_skewness", "dm_kurtosis", "class")
head(pulsar_data, n = 6)
Rows: 17898 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): X1, X2, X3, X4, X5, X6, X7, X8, X9

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 9
profile_meanprofile_stdevprofile_skewnessprofile_kurtosisdm_meandm_stdevdm_skewnessdm_kurtosisclass
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
140.5625055.68378-0.23457141-0.69964843.19983319.11043 7.975532 74.242220
102.5078158.88243 0.46531815-0.51508791.67725814.8601510.576487127.393580
103.0156239.34165 0.32332837 1.05116443.12123721.74467 7.735822 63.171910
136.7500057.17845-0.06841464-0.63623843.64297720.95928 6.896499 53.593660
88.7265640.67223 0.60086608 1.12349171.17893011.4687214.269573252.567310
93.5703146.69811 0.53190485 0.41672111.63628814.5450710.621748131.394000

You can see the first 6 rows of the dataset above.

3. Cleaning and wrangling data¶

The class column of the initial dataset has a numeric data type (dbl). However, we know that this is a binary classification problem and the class can only be 0 or 1 so we will convert the values of this column into factor data type.

In [3]:
# change class to factor
pulsar_data <- pulsar_data |>
    mutate(class = as.factor(class))

4. Splitting train and test data¶

We split the dataset into train and testing set, using 75% of the observations for training and the rest for final testing.

In [4]:
# train-test split
pulsar_split <- initial_split(pulsar_data, prop = 0.75, strata = class)
pulsar_train <- training(pulsar_split)
pulsar_test <- testing(pulsar_split)

5. Summarizing training data (exploratory data analysis)¶

We will summarize the data by calculating the mean of every numeric variable for each class, and also counting observations in each class. Then, we will count the number of missing entries in each column, all of which appear to be zero according to the second table below:

In [5]:
# summarize data
pulsar_train |>
    group_by(class) |>
    summarize(class_count = n(), across(profile_mean:dm_kurtosis, mean))

summarize(pulsar_train, across(profile_mean:dm_kurtosis, ~ sum(is.na(.x))))
A tibble: 2 × 10
classclass_countprofile_meanprofile_stdevprofile_skewnessprofile_kurtosisdm_meandm_stdevdm_skewnessdm_kurtosis
<fct><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
012195116.6379147.353260.2093131 0.3776953 8.74428623.281528.847885112.93407
1 1228 56.8854738.815693.112823515.399144149.92853456.148182.787913 18.58492
A tibble: 1 × 8
profile_meanprofile_stdevprofile_skewnessprofile_kurtosisdm_meandm_stdevdm_skewnessdm_kurtosis
<int><int><int><int><int><int><int><int>
00000000

According to the first table, the number of positive class observations is significantly lower than negative classes. Therefore, we will probably need to balance the data by upsampling to prevent this from affecting our model. Moreover, according to the second table, there are no missing values in the dataset.

6. Visualizing training data (exploratory data analysis)¶

We will draw a histogram of the distribution of each of the numeric columns in our data set.

In [6]:
# visualize the data
options(repr.plot.width = 8, repr.plot.height = 8)

long_data <- pulsar_train |>
    select(profile_mean:dm_kurtosis) |>
    gather(column_name, value)

# draw a histogram for each numeric variable
hist_plots <- ggplot(long_data, aes(x = value)) +
    geom_histogram() +
    facet_wrap(~ column_name, scales = "free") +
    ggtitle("Figure 1: Histograms of variable distributions")

hist_plots
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
No description has been provided for this image

Looking at the histograms above, we can see that profile standard deviation and profile mean have distributions similar to a bell-shaped normal distribution curve (with negligible skewness). However, the other predictors do not seem to have normal distribution. For example, for DM mean, most of the values are very small and close to 0 but there are values that are as large as 200 as well.

Now we will use a correlation heatmap to visualize the absolute value of correlation between every pair of our variables. We are particularly interested in the correlations between the class column (target variable) and other columns, because this can help us choose our predictor variables.

In [7]:
# draw correlation heatmap
options(repr.plot.width = 8, repr.plot.height = 8)

# create a correlation matrix
correlation_matrix <- pulsar_train |>
    mutate(class = as.numeric(class)) |>
    cor()

# change the matrix to dataframe and change values to absolute values
cor_df <- melt(correlation_matrix) |>
    mutate(value = abs(value))

ggplot(cor_df, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(title = "Figure 2: Correlation heatmap") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
No description has been provided for this image

According to the correlation heatmap above, profile_kurtosis, profile_skewness, and profile_mean have the largest absolute values of correlation with the target class. Therefore, we will be using these variables as predictors in our model.

Data Analysis¶

We will use the K-nearest neighbors algorithm to train a model to answer our predictive question.

1. Creating a data preprocessing recipe¶

We will create a recipe to standardize (center and scale) our predictor variables, so that the difference in the variability of different predictors does not make any of them more influential in KNN algorithm.

In [8]:
pulsar_recipe <- recipe(class ~ profile_kurtosis + profile_skewness + profile_mean, data = pulsar_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

2. Preparing the model specification for cross-validation¶

In [9]:
pulsar_vfold <- vfold_cv(pulsar_train, v = 5, strata = class)

k_values <- tibble(neighbors = seq(from = 1, to = 50, by = 5))

knn_spec_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

3. Running cross-validation¶

In order to determine which value of K to use for the final model, cross-validation must be used on the training data. We are running cross-validation with 5 folds on the training data, for values of K from 1 to 50 with steps of 5. (Note: the dataset is fairly large so expect this block of code to take ~2 minutes to run as measured on UBC open jupyter server)

In [10]:
tuning_metrics <- workflow() |>
    add_recipe(pulsar_recipe) |>
    add_model(knn_spec_tune) |>
    tune_grid(resamples = pulsar_vfold, grid = k_values) |>
    collect_metrics()

tuning_accuracies <- tuning_metrics |>
    filter(.metric == "accuracy")

tuning_accuracies
A tibble: 10 × 7
neighbors.metric.estimatormeannstd_err.config
<dbl><chr><chr><dbl><int><dbl><chr>
1accuracybinary0.965060150.0008435235Preprocessor1_Model01
6accuracybinary0.978171950.0015379182Preprocessor1_Model02
11accuracybinary0.978246450.0014255390Preprocessor1_Model03
16accuracybinary0.978395550.0013828856Preprocessor1_Model04
21accuracybinary0.978097450.0013699906Preprocessor1_Model05
26accuracybinary0.978246350.0013197287Preprocessor1_Model06
31accuracybinary0.978246450.0014828539Preprocessor1_Model07
36accuracybinary0.977948350.0016292682Preprocessor1_Model08
41accuracybinary0.977724850.0014728424Preprocessor1_Model09
46accuracybinary0.977501350.0015382355Preprocessor1_Model10

The table above shows the estimated mean of the accuracy value of the model for each value of K.

4. Visualizing the result of cross-validation¶

We use a line plot to visualize mean accuracy values reported for different values of K.

In [11]:
knn_plot <- ggplot(tuning_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    xlab("value of K") +
    ylab("Mean accuracy") +
    ggtitle("Figure 3: Plot of cross-validation results")

knn_plot
No description has been provided for this image

According to this plot, the value of 16 for K seems to be leading to the highest accuracy estimate. This means that choosing K is likely to prevent our model from being overfit or underfit. Therefore, we will use a block of code to extract this optimal K value and use it in building our final model.

In [12]:
optimal_k <- slice_max(tuning_accuracies, mean, n = 1) |>
    select(neighbors) |>
    pull()

optimal_k
16

5. Building the final model¶

We use the optimal value of K found through cross-validation to build a final model specification and then train the model on training data.

In [13]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = optimal_k) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_fit <- workflow() |>
    add_recipe(pulsar_recipe) |>
    add_model(knn_spec) |>
    fit(data = pulsar_train)

6. Final testing¶

We use the final model to make predictions for the testing data. We compute model accuracy and confusion matrix to report the performance of our model.

In [14]:
pulsar_predictions <- knn_fit |>
    predict(pulsar_test) |>
    bind_cols(pulsar_test)

pulsar_metrics <- pulsar_predictions |>
    metrics(truth = class, estimate = .pred_class) |>
    filter(.metric == "accuracy")

pulsar_conf_mat <- pulsar_predictions |>
    conf_mat(truth = class, estimate = .pred_class)

pulsar_metrics
pulsar_conf_mat
A tibble: 1 × 3
.metric.estimator.estimate
<chr><chr><dbl>
accuracybinary0.9798883
          Truth
Prediction    0    1
         0 4036   62
         1   28  349

The results show a final accuracy of over 97%. We determined this could be due to the imbalanced nature of the pulsar data. To further explain the concept, we looked into the confusion matrix and discovered that the values in the ninth column are largely comprised of “0” values, and the proportional ratio between 0 and 1 is 10:1 in favor of “0”.

Under these circumstances, our group tried to balance the data with the ‘step_upsample’ function. We hoped the ‘step_upsample’ function could help us by oversampling the variable “1” presented in the ninth column. Nonetheless, the function could not be successfully run by the computer by an unknown error. As we consulted our teaching assistant for advice, we were told that the balancing of our data was not required, leaving us with a 97% accuracy with a k-value of 16.

7. Final visualization¶

Finally we will use 3 scatter plots to visualize the relations between our predictor variables as the final visualization of this project.

In [15]:
final_plot_1 <- ggplot(pulsar_predictions, aes(x = profile_mean, y = profile_kurtosis, color = .pred_class)) +
    geom_point(alpha = 0.2) +
    xlab("Profile Mean") +
    ylab("Profile Kurtosis") +
    labs(color = "Predicted Class") +
    ggtitle("Figure 4: Plot of profile kurtosis against profile mean")

final_plot_2 <- ggplot(pulsar_predictions, aes(x = profile_mean, y = profile_skewness, color = .pred_class)) +
    geom_point(alpha = 0.2) +
    xlab("Profile Mean") +
    ylab("Profile Skewness") +
    labs(color = "Predicted Class") +
    ggtitle("Figure 5: Plot of profile skewness against profile mean")

final_plot_3 <- ggplot(pulsar_predictions, aes(x = profile_skewness, y = profile_kurtosis, color = .pred_class)) +
    geom_point(alpha = 0.2) +
    xlab("Profile Skewness") +
    ylab("Profile Kurtosis") +
    labs(color = "Predicted Class") +
    ggtitle("Figure 6: Plot of profile kurtosis against profile skewness")

final_plot_1
final_plot_2
final_plot_3
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

We can see that there is a positive relationship between profile skewness and kurtosis, while profile mean seems to have a negative relationship with both other variables. All these 3 relationships seem to be non-linear. Also, the observations detected as pulsars by the model seem to be having larger profile skewness and kurtosis values, but smaller profile mean.

Discussion¶

Summarizing what we found¶

We found that observations with a low profile mean, high profile kurtosis, and high profile skewness are most likely to be pulsars.

Since we created the group proposal, this was our expected conclusion. The heatmap we created showed a strong correlation between these attributes and whether a star was a pulsar or not. Therefore, we were already aware of the existence of some kind of relationship.

Impacts of these findings¶

If a situation arises where restricted resources are available such that only a limited number of attributes can be collected and measured, the collection of certain information may have to be prioritized over others. This model may indicate that the three attributes (profile mean, profile kurtosis, and profile skewness) should be prioritized over others, as they were sufficient in determining the nature in each observation. Furthermore, there are scientific implications behind our capacity to classify stars as pulsars or not. Information on pulsar stars is valuable in the field of astronomy, due to their ability to detect gravitational waves and their contributions to studying stellar evolution (3).

Future questions¶

These could be possible additional questions that can be explored regarding this topic:

  • Do the attributes of a star change given time? If so, how does this affect its potential classification as a pulsar?

  • How does time affect these attributes?

  • Are there different types of pulsars that the attributes of this model do not account for? How can we be sure that we can distinguish between a non-pulsar star and every potential variant of a pulsar star? (If there are different types of pulsar stars, this binary classification problem can be extended to a multiclass classification problem to predict the types of pulsar stars as well)

  • What other attributes could potentially reveal the nature of a star?

References¶

(1) Lea, Robert. “What Are Pulsars?” Space.Com, Space, 22 Apr. 2016,

https://www.space.com/32661-pulsars.html.

(2) Lyon,Robert. (2017). HTRU2. UCI Machine Learning Repository.

https://doi.org/10.24432/C5DK6R.

(3) Rini, Matteo. “Researchers Capture Gravitational-Wave Background with Pulsar ‘Antennae.’” Physics, American Physical Society, 29 June 2023,

https://physics.aps.org/articles/v16/118.